Contingency table review
In this chapter you’ll continue working with the comics
dataset introduced in the video. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.
Let’s start by creating a contingency table, which is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.
library(readr)
comics <- read_csv("_data/comics.csv", col_types = "ffffffffiff")
# Print the first rows of the data
comics
## # A tibble: 23,272 x 11
## name id align eye hair gender gsm alive appearances first_appear
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int> <fct>
## 1 "Spi~ Secr~ Good Haze~ Brow~ Male <NA> Livi~ 4043 Aug-62
## 2 "Cap~ Publ~ Good Blue~ Whit~ Male <NA> Livi~ 3360 Mar-41
## 3 "Wol~ Publ~ Neut~ Blue~ Blac~ Male <NA> Livi~ 3061 Oct-74
## 4 "Iro~ Publ~ Good Blue~ Blac~ Male <NA> Livi~ 2961 Mar-63
## 5 "Tho~ No D~ Good Blue~ Blon~ Male <NA> Livi~ 2258 Nov-50
## 6 "Ben~ Publ~ Good Blue~ No H~ Male <NA> Livi~ 2255 Nov-61
## 7 "Ree~ Publ~ Good Brow~ Brow~ Male <NA> Livi~ 2072 Nov-61
## 8 "Hul~ Publ~ Good Brow~ Brow~ Male <NA> Livi~ 2017 May-62
## 9 "Sco~ Publ~ Neut~ Brow~ Brow~ Male <NA> Livi~ 1955 Sep-63
## 10 "Jon~ Publ~ Good Blue~ Blon~ Male <NA> Livi~ 1934 Nov-61
## # ... with 23,262 more rows, and 1 more variable: publisher <fct>
# Check levels of align
levels(comics$align)
## [1] "Good" "Neutral" "Bad"
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "Male" "Female" "Other"
# Create a 2-way contingency table
table(comics$align, comics$gender)
##
## Male Female Other
## Good 4809 2490 17
## Neutral 1799 836 17
## Bad 7561 1573 32
## Reformed Criminals 2 1 0
Dropping levels
The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.
In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels()
. This is because the droplevels()
function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.
# Load dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Print tab
tab <- table(comics$align, comics$gender)
tab
##
## Male Female Other
## Good 4809 2490 17
## Neutral 1799 836 17
## Bad 7561 1573 32
## Reformed Criminals 2 1 0
# Remove align level
comics_filtered <- comics %>%
filter(align != "Reformed Criminals") %>%
droplevels()
# See the result
comics_filtered
## # A tibble: 19,856 x 11
## name id align eye hair gender gsm alive appearances first_appear
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int> <fct>
## 1 "Spi~ Secr~ Good Haze~ Brow~ Male <NA> Livi~ 4043 Aug-62
## 2 "Cap~ Publ~ Good Blue~ Whit~ Male <NA> Livi~ 3360 Mar-41
## 3 "Wol~ Publ~ Neut~ Blue~ Blac~ Male <NA> Livi~ 3061 Oct-74
## 4 "Iro~ Publ~ Good Blue~ Blac~ Male <NA> Livi~ 2961 Mar-63
## 5 "Tho~ No D~ Good Blue~ Blon~ Male <NA> Livi~ 2258 Nov-50
## 6 "Ben~ Publ~ Good Blue~ No H~ Male <NA> Livi~ 2255 Nov-61
## 7 "Ree~ Publ~ Good Brow~ Brow~ Male <NA> Livi~ 2072 Nov-61
## 8 "Hul~ Publ~ Good Brow~ Brow~ Male <NA> Livi~ 2017 May-62
## 9 "Sco~ Publ~ Neut~ Brow~ Brow~ Male <NA> Livi~ 1955 Sep-63
## 10 "Jon~ Publ~ Good Blue~ Blon~ Male <NA> Livi~ 1934 Nov-61
## # ... with 19,846 more rows, and 1 more variable: publisher <fct>
Side-by-side barcharts
While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.
Here you’ll construct two side-by-side barcharts of the comics
data. This shows that there can often be two or more options for presenting the same data. Passing the argumentposition = "dodge"
to geom_bar()
says that you want a side-by-side (i.e. not stacked) barchart.
# Load ggplot2
library(ggplot2)
# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge")
# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))
Q: Approximately what proportion of all female characters are good?
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab) # Joint proportions
##
## Male Female Other
## Good 0.2512933 0.1301144 0.0008883
## Neutral 0.0940064 0.0436850 0.0008883
## Bad 0.3950985 0.0821968 0.0016722
## Reformed Criminals 0.0001045 0.0000523 0.0000000
prop.table(tab, 2) # Conditional on columns
##
## Male Female Other
## Good 0.339355 0.508163 0.257576
## Neutral 0.126949 0.170612 0.257576
## Bad 0.533554 0.321020 0.484848
## Reformed Criminals 0.000141 0.000204 0.000000
A: 51%
Nice! To answer this question, you needed to look at how align
was distributed within each gender
. That is, you wanted to condition on the gender
variable.
Counts vs. proportions (2)
Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you’ll construct two barcharts in this exercise: one of counts and one of proportions.
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill") +
ylab("proportion")
Excellent work! By adding position = "fill"
to geom_bar()
, you are saying you want the bars to fill the entire height of the plotting window, thus displaying proportions and not raw counts.
Marginal barchart
If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a barchart for just that single variable.
You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are "Neutral"
show an alignment between "Good"
and "Bad"
, so it makes sense to put that bar in the middle.
# Change the order of the levels in align
comics_filtered$align <- factor(comics_filtered$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics_filtered, aes(x = align)) +
geom_bar()
Conditional barchart
Now, if you want to break down the distribution of alignment based on gender, you’re looking for conditional distributions.
You could make these by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender. As a point of comparison, we’ve provided your plot of the marginal distribution of alignment from the last exercise.
# Plot of alignment broken down by gender
ggplot(comics_filtered, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
Improve piechart
The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts.
This is a piechart of a dataset called pies
that contains the favorite pie flavors of 98 people. Improve the representation of these data by constructing a barchart
that is ordered in descending order of count.
pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))
# Put levels of flavor in descending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)
# Create barchart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))
# Alternative solution to finding levels
# lev <- unlist(select(arrange(cnt, desc(n)), flavor))
Faceted histogram
In this chapter, you’ll be working with the cars
dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. You will investigate the distribution of mileage across a categorial variable, but before you get there, you’ll want to familiarize yourself with the dataset.
cars <- read.csv("_data/cars04.csv", stringsAsFactors = TRUE)
# Load package
library(ggplot2)
# Learn data structure
str(cars)
## 'data.frame': 428 obs. of 19 variables:
## $ name : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).
In this exercise, you faceted by the suv
variable, but it’s important to note that you can facet a plot by any categorical variable using facet_wrap()
. Nice job!
Boxplots and density plots
The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this exercise you’ll try your hand at two alternatives: the box plot and the density plot.
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))
# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).
Marginal and conditional histograms
Now, turn your attention to a new variable: horsepwr
. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.
You’ll be making two plots using the “data pipeline” paradigm, where you start with the raw data and end with the plot.
# Create hist of horsepwr
cars %>%
ggplot(aes(x = horsepwr)) +
geom_histogram() +
ggtitle("Distribution of horsepower for all cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create hist of horsepwr for affordable cars
cars %>%
filter(msrp < 25000) %>%
ggplot(aes(x = horsepwr)) +
geom_histogram() +
xlim(c(90, 550)) +
ggtitle("Distribution of horsepower for affordable cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Three binwidths
Before you take these plots for granted, it’s a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It’s good practice to consider several binwidths in order to detect different types of structure in your data.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("binwidth = 3")
# Create hist of horsepwr with binwidth of 30
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle("binwidth = 30")
# Create hist of horsepwr with binwidth of 60
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle("binwidth = 60")
you cannot see the mode(s) from a boxplot.
Box plots for outliers
In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp
column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.
# Construct box plot of msrp
cars %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
# Exclude outliers from data
cars_no_out <- cars %>%
filter(msrp < 100000)
# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
Plot selection
Consider two other columns in the cars
dataset: city_mpg
and width
. Which is the most appropriate plot for displaying the important features of their distributions? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.
# Create plot of city_mpg
cars %>%
ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
# Create plot of width
cars %>%
ggplot(aes(x = width)) +
geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).
Great work! Because the city_mpg
variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.
Higher dimensional plots
3 variable plot
Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.
# Facet hists using hwy mileage and ncyl
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv) +
ggtitle("Mileage by suv and ncyl")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).
Calculate center measures
Throughout this chapter, you will use data from gapminder
, which tracks demographic data in countries of the world over time. To learn more about it, you can bring up the help file with ?gapminder
.
For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by()
and summarize()
, a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.
library(gapminder)
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fct> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
Calculate spread measures
Let’s extend the powerful group_by()
and summarize()
syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fct> <dbl> <dbl> <int>
## 1 Africa 9.63 11.6 52
## 2 Americas 4.44 4.63 25
## 3 Asia 7.96 10.2 33
## 4 Europe 2.98 4.78 30
## 5 Oceania 0.729 0.516 2
# Generate overlaid density plots
gap2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
Choose measures for center and spread
Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them.
# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 x 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
# Compute stats for population
gap2007 %>%
summarize(median(pop),
IQR(pop))
## # A tibble: 1 x 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008.
Excellent! Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.
Transformations
Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.
Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log()
in R).
# Create density plot of old variable
gap2007 %>%
ggplot(aes(x = pop)) +
geom_density()
# Transform the skewed pop variable
gap2007 <- gap2007 %>%
mutate(log_pop = log(pop))
# Create density plot of new variable
gap2007 %>%
ggplot(aes(x = log_pop)) +
geom_density()
Characteristics of a distribution
Identify outliers
Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min()
or filter()
, then proceed to building a plot with that country removed.
# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
filter(continent == "Asia") %>%
mutate(is_outlier = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
gap_asia %>%
filter(!is_outlier) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
Spam and num_char
Is there an association between spam and the length of an email? You could imagine a story either way:
Here, you’ll use the email
dataset to settle that question. Begin by bringing up the help file and learning about all the variables with ?email
.
As you explore the association between spam and the length of an email, use this opportunity to try out linking a dplyr
chain with the layers in a ggplot2
object.
# Load packages
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
library(dplyr)
email$spam <- factor(email$spam, labels = c("not-spam", "spam"))
# Compute summary statistics
email %>%
group_by(spam) %>%
summarize(median(num_char),
IQR(num_char))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## spam `median(num_char)` `IQR(num_char)`
## <fct> <dbl> <dbl>
## 1 not-spam 6.83 13.6
## 2 spam 1.05 2.82
# Create plot
email %>%
mutate(log_num_char = log(num_char)) %>%
ggplot(aes(x = spam, y = log_num_char)) +
geom_boxplot()
Awesome job! You’ll interpret this plot in the next exercise.
Spam and !!!
Let’s look at a more obvious indicator of spam: exclamation marks. exclaim_mess
contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.
Experiment with different types of plots until you find one that is the most informative. Recall that you’ve seen:
If you decide to use a log transformation, remember that log(0)
is -Inf
in R, which isn’t a very useful value! You can get around this by adding a small number (like 0.01
) to the quantity inside the log()
function. This way, your value is never zero. This small shift to the right won’t affect your results.
# Compute center and spread for exclaim_mess by spam
email %>%
group_by(spam) %>%
summarize(median(exclaim_mess),
IQR(exclaim_mess))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## spam `median(exclaim_mess)` `IQR(exclaim_mess)`
## <fct> <dbl> <dbl>
## 1 not-spam 1 5
## 2 spam 0 1
# Create plot for spam and exclaim_mess
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + 0.01)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram() +
facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Alternative plot: side-by-side box plots
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + 0.01)) %>%
ggplot(aes(x = 1, y = log_exclaim_mess)) +
geom_boxplot() +
facet_wrap(~ spam)
# Alternative plot: Overlaid density plots
email %>%
mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
ggplot(aes(x = log_exclaim_mess, fill = spam)) +
geom_density(alpha = 0.3)
Collapsing levels
If it was difficult to work with the heavy skew of exclaim_mess
, the number of images attached to each email (image
) poses even more of a challenge. Run the following code at the console to get a sense of its distribution:
table(email$image)
Recall that this tabulates the number of cases in each category (so there were 3811 emails with 0 images, for example). Given the very low counts at the higher number of images, let’s collapse image
into a categorical variable that indicates whether or not the email had at least one image. In this exercise, you’ll create this new variable and explore its association with spam.
table(email$image)
##
## 0 1 2 3 4 5 9 20
## 3811 76 17 11 2 2 1 1
# Create plot of proportion of spam by image
email %>%
mutate(has_image = image > 0) %>%
ggplot(aes(x = has_image, fill = spam)) +
geom_bar(position = "fill")
An email without an image is more likely to be not-spam than spam.
Data Integrity
In the process of exploring a dataset, you’ll sometimes come across something that will lead you to question how the data were compiled. For example, the variable num_char
contains the number of characters in the email, in thousands, so it could take decimal values, but it certainly shouldn’t take negative values.
You can formulate a test to ensure this variable is behaving as we expect:
email$num_char < 0
If you run this code at the console, you’ll get a long vector of logical values indicating for each case in the dataset whether that condition is TRUE
. Here, the first 1000 values all appear to be FALSE
. To verify that all of the cases indeed have non-negative values for num_char
, we can take the sum of this vector:
sum(email$num_char < 0)
This is a handy shortcut. When you do arithmetic on logical values, R treats TRUE
as 1
and FALSE
as 0
. Since the sum over the whole vector is zero, you learn that every case in the dataset took a value of FALSE
in the test. That is, the num_char
column is behaving as we expect and taking only non-negative values.
# Test if images count as attachments
sum(email$image > email$attach)
## [1] 0
Great work! Since image
is never greater than attach
, we can infer that images are counted as attachments.
Answering questions with chains
When you have a specific question about a dataset, you can find your way to an answer by carefully constructing the appropriate chain of R code. For example, consider the following question:
“Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?”
This can be answered with the following chain:
email %>%
filter(spam == "not-spam") %>%
group_by(to_multiple) %>%
summarize(median(num_char))
The code makes it clear that you are using num_char
to measure the length of an email and median()
as the measure of what is typical. If you run this code, you’ll learn that the answer to the question is “yes”: the typical length of non-spam sent to multiple people is a bit lower than those sent to only one person.
This chain concluded with summary statistics, but others might end in a plot; it all depends on the question that you’re trying to answer.
# Question 1
email %>%
filter(dollar > 0) %>%
group_by(spam) %>%
summarize(median(dollar))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## spam `median(dollar)`
## <fct> <dbl>
## 1 not-spam 4
## 2 spam 2
# Question 2
email %>%
filter(dollar > 10) %>%
ggplot(aes(x = spam)) +
geom_bar()
What’s in a number?
Turn your attention to the variable called number
. Read more about it by pulling up the help file with ?email
.
To explore the association between this variable and spam
, select and construct an informative plot. For illustrating relationships between categorical variables, you’ve seen
Let’s practice constructing a faceted barchart.
# Reorder levels
email$number_reordered <- factor(email$number, levels = c("none", "small", "big"))
# Construct plot of number_reordered
ggplot(email, aes(x = number_reordered)) +
geom_bar() +
facet_wrap(~ spam)